2 research outputs found
LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked
Large language models (LLMs) have skyrocketed in popularity in recent years
due to their ability to generate high-quality text in response to human
prompting. However, these models have been shown to have the potential to
generate harmful content in response to user prompting (e.g., giving users
instructions on how to commit crimes). There has been a focus in the literature
on mitigating these risks, through methods like aligning models with human
values through reinforcement learning. However, it has been shown that even
aligned language models are susceptible to adversarial attacks that bypass
their restrictions on generating harmful text. We propose a simple approach to
defending against these attacks by having a large language model filter its own
responses. Our current results show that even if a model is not fine-tuned to
be aligned with human values, it is possible to stop it from presenting harmful
content to users by validating the content using a language model
Robust Principles: Architectural Design Principles for Adversarially Robust CNNs
Our research aims to unify existing works' diverging opinions on how
architectural components affect the adversarial robustness of CNNs. To
accomplish our goal, we synthesize a suite of three generalizable robust
architectural design principles: (a) optimal range for depth and width
configurations, (b) preferring convolutional over patchify stem stage, and (c)
robust residual block design through adopting squeeze and excitation blocks and
non-parametric smooth activation functions. Through extensive experiments
across a wide spectrum of dataset scales, adversarial training methods, model
parameters, and network design spaces, our principles consistently and markedly
improve AutoAttack accuracy: 1-3 percentage points (pp) on CIFAR-10 and
CIFAR-100, and 4-9 pp on ImageNet. The code is publicly available at
https://github.com/poloclub/robust-principles.Comment: Published at BMVC'2